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Abstract 

The efficacy of family-based approaches to mixture model-based clustering and classification depends 
on the selection of parsimonious models. Current wisdom suggests the Bayesian information criterion 
(BIC) for mixture model selection. However, the BIC has well-known limitations, including a tendency 
to overestimate the number of components as well as a proclivity for, often drastically, underestimating 
the number of components in higher dimensions. While the former problem might be soluble through 
merging components, the latter is impossible to mitigate in clustering and classification applications. In 
this paper, a LASSO-penalized BIC (LPBIC) is introduced to overcome this problem. This approach is 
illustrated based on applications of extensions of mixtures of factor analyzers, where the LPBIC is used 
to select both the number of components and the number of latent factors. The LPBIC is shown to 
match or outperform the BIC in several situations. 



1 Introduction 

Consider n realizations (xi, X2, x„) of a p-dimensional random variable X that follows a G-component 
finite Gaussian mixture model. The likelihood is given by 



i=lg=l 



(1) 



where tt^ > 0, with X]g=i ""a ~ 1j ^^'^ mixing proportions, 0(x | fig, Sg) is multivariate Gaussian density with 
mean fig and covariance matrix S^, and 1? — (tti, . . . , ttq, fi^, . . . , fiQ, Si, . . . , Sq). A model-based clustering 
approach assumes that each component or some combination of components corresponds to a cluster. When 



fitting the model in (|1[), the main task is to decide the number of components G. Titterington et al. ( 1985 ) 



McLachan and Basford (1988) and McLachan and Peel (2002) extensively reviewed mixture models, with 



a focus on Gaussian mixture models. Fraley and Raftery (2002) presented a review of work on Gaussian 



mixtures with a focus on clustering, discriminant analysis, and density estimation. They discuss a family 
of Gaussian mixture models, which arises from the imposition of constraints upon an eigen-decomposition 
of the component covariance structure. The family of mixture models they discuss, known as MCLUST, is 



actually a subset of the Gaussian parsimonious clustering models (GPCMs) of Celeux and Govaert (1995). 
When using the MCLUST models, one must choose the appropriate member of the family, i.e., the covariance 
structure, in addition to deciding the number of components G. 



Ghahramani and Hinton (1997) introduced a mixture of factor analyzers model, which was further de- 
veloped by Tipping and Bishop (1999) and McLachlan and Peel (2000). Through foisting constraints on 



the covariance structure, McNicholas and Murphy^ ( ,2008^ ,2010 ) develop mixtures of factor analyzers into a 
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family of parsimonious Gaussian mixture models (PGMMs). Now, in addition to selecting the member of 
the family (i.e., the covariance structure) and the number of components, one must also select the number 
of latent factors. Further complicating the model selection problem here is the fact that PGMMs are often 



applied to high-dimensional data. McNicholas et al. (2010) explain why the PGMMs are particularly suited 



to the analysis of high-dimensional data: amongst the most salient points is the fact that, unlike families 
like MCLUST, the number of covariance parameters is linear in data dimensionality for every member of 
the PGMM family. 

There are a number of well-known methods to select the best mixture model but the BIG remains by 
far-and-away the most popular. We have 



BIG = 21og/:(i9 I x) -plogn. 



(2) 



where i9 is the MLE of i9, £ is the likelihood, p is the number of free parameters and n is the number of 
observations. For a family of mixture models, the model having the maximum BIG is selected. The use of 
BIG is theoretically justified by a number of authors, e.g., Kass and Wasserman (1995), Kass and Raftery 



1995 


), and 


Keirbin 


(2000 



In particular, the BIG has some useful asymptotic properties, e.g., the criterion 



consistently chooses the right model under an increasing number of observations (Shibata 1986). 

Nevertheless, the BIG is not without drawbacks. The criterion is derived using a Laplace approximation 
and its precision is influenced by the specific form of the prior density of the parameters as well as the 
correlation structure between observations. Recently, Glyde et al. (20071 have rectified the problems of 
the marginal distribution of the parameter, caused by the Laplace approximation. In addition, |Fraley| 
and Raftery ( 2007[ ) proposed a Bayesian regularization for Gaussian mixtures. Their method assumes pre- 
defined priors that lead to a modified version of the BIG, using posterior modes instead of the maximum 
likelihood estimates (MLEs) of the parameters. The resulting method avoids degeneracies, singularities, and 
the problem of flat priors. However, another more serious problem has not been addressed, i.e., the problem 
of high-dimensional cases. 

The penalty term in the BIG is plogn, cf. ([2|. Therefore, in a high-dimensional setting, where p 3> n, 
the penalty term dominates the likelihood and so the BIG is prone to under fitting. Parametric estimation 
for high-dimensional cases has been studied by a number of authors, mostly within the linear regression 
set-up. The celebrated LASSO method (Tibshirani 1996) is perhaps the most popular among them. This 



method minimizes the residual sum of squares under the constraint that the sum of the absolute values of the 
regression coefficients is less than some constant, leading to sparse solutions of the coefficients and thus an 
interpretable model. In the following years, different variations of the LASSO have been proposed depending 



on the nature of regression and asymptotic behaviour. Some of them are the adaptive LASSO ( Zou 2006 ) 



the fused LASSO ( Tibshirani et al 



2008). 



Fan and Li 



2005), and the graphical LASSO (Friedman et al. 
( |2001[ ) provided a theoretical discussion of variable selection via a non-concave penalized likelihood procedure 
where the LASSO is a special case. They also proposed that a good penalized estimation should satisfy the 
oracle properties, i.e., it should be consistent and the estimates should be asymptotically Gaussian. 



Following the idea of Fan and Li (2001 1, Khalili and Ghen (20071 were the first to propose the use of the 



penalized likelihood in finite mixture of regression models, where the penalty is non-concave LASSO being 
a special case. They also devised a method of selecting the tuning parameter as well as conditions under 
which the estimation procedure would satisfy the oracle properties. Their method is especially suitable for 
finite mixtures of regression models, though no new model selection criterion was proposed. It should also 
be noted that the theoretical results regarding the asymptotic properties were somehow strange, because 
the authors used the same tuning parameter comparing two different estimates for a fixed cluster. Ghen 



and Ghen (2008) proposed an extended BIG for regression in high-dimensional setting. The extended BIG 
assumes a prior inversely proportional to the size of the assumed model instead of a flat prior. The criterion 
is consistent and computationally cheap. Interestingly, the authors did not propose any penalized likelihood 
here, instead they maximized the natural likelihood, thus using the conventional estimation procedure. The 
above estimation procedures, though interesting and useful, are mainly for regression-type problems, and 



2 



not applicable to mixture model-based clustering and classification. Also, as the authors rightly pointed out, 
the approach is computationally infeasible if p » n. Nevertheless, useful extensions can be possible. Herein, 



we draw upon some mathematical results from Fan and Li (2001) and Khalili and Chen (20071, especially 



on the issues of the choice of penalty and consistency. 

The use of penalized likelihood in mixture model-based clustering has been proposed by |Pan and Shen| 
(2007), where a LASSO-type penalty is applied to the likelihood. From there, they went on to propose 
a modified BIC which would be well-suited for high-dimensional settings. The limitation of that method 
is that this criterion works only for a common, diagonal component covariance matrix. Furthermore, the 
authors did not study the asymptotic properties, which are important in the sense that the classical LASSO 
method can be inconsistent (cf. |Zou 2006). An ideal criterion should be analytically derivable from the 
penalized likelihood, work well for an arbitrary model, and have some good asymptotic properties. The 
work presented herein attempts to address these requirements by proposing LASSO-penalized BIC (LPBIC) 
for model selection within high-dimensional setting for the PGMM family. 

While deriving the MLE of the unknown parameters, we use a penalized likelihood approach. In partic- 
ular, instead of maximizing the likelihood | x), we maximize the penalized log-likelihood 



logCi'd I x) - ^ TTg ^ ipi^igj] 



9=1 J = l 



We use a LASSO-like penalty for ip{figj). In particular, ip{figj) = nA„|/Xgj|, where figj is the jth element in 
fig and A„ is the tuning parameter that depends on n. Though a LASSO penalty is used here, other types of 

For example, one might use the HARD pen alty (p{figj) = [A^ — 

hoOll. One problem with 



non-concave penalties can also be suitable. 



Fan and Li 



{y/n^gj — A„) / {y/n^gj < Aji)] or the SCAD penalty, as discussed by 
using such an Li-norm penalty is that the oracle properties might not be satisfied fully: the estimation can be 
consistent but not asymptotically normal. HARD or SCAD penalties satisfy both these properties and these 
issues are discussed in more detail in Section [3j Still, however, we prefer the LASSO-type penalty because 
it is computationally easier due to its convexity. From this penalized likelihood, we derive a model selection 
criterion. We use a modified AECM algorithm (|McLachan and Peel 2002 1 to estimate the parameters in 



the PGMM models. We show that in high-dimensional settings, our LPBIC generally outperforms the BIC 
for the PGMM family. 

The remainder of this paper is laid out as follows. In Section [2] we discuss parameter estimation under 
the penalized likelihood approach and derive an LPBIC. The asymptotic properties of LPBIC are discussed 
(Section |3| and we illustrate our approach on real and simulated data (Section |4|. The real data considered 
exhibit the 'small n, large p' property and our data analysis results are compared with the BIC. The paper 
concludes with a discussion (Sectionjsj), while the mathematical derivation of LPBIC as well as its asymptotic 
properties are discussed in appendices. 



2 Method 

Again, suppose we observe x = (xi, X2, x„) with /(x | i9) = Y.g^i'^g't'i^ I /^gi^s): where 0(x | Atg,Sg) 
is multivariate Gaussian density with mean fig and covariance matrix Sg. Now, instead of maximizing the 
likelihood | x), we maximize the penalized log-likelihood 

G p 

l0g£pen(l? I x) = l0g£(l? I x) - nXn^TTg^lfigjl, (3) 

ff=l 3 = 1 
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where /Xj, and A„ are defined as before. Hereafter, we denote ip{fi) = ""a X]j=i 'fil^gj) 

G p 

log£pcn('»9 I x) = log/:(i9 I x) - nA„ ^T^g^ vifJ-g]) = log£(i? | x) - nX„>f{n). 

g=l j = l 

Before going into details of parameter estimation, we make two assumptions. Firstly, as we can observe, 
the penalty function is non-concave and singular at the origin; it does not have second derivative at 0. We 



locally approximate the penalty by a quadratic function as suggested by Fan and Li (2001 1. The parameters 
are estimated by successive iterations. Suppose /m^™) is the estimate of of /i after m iterations. The penalty 
can be locally approximated as 

9=1 i=i '"gj" 



where pg is the number of non-zero elements in fig. We assume that the marginal distribution of the mixing 
proportions (tti, 7r2, tt^) is uniform on the simplex and that fjig ^ M{iJig,I{jjig)^^), for g = 1,2,...,G, 
where fig is the MLE derived by maximizing the penalized likelihood £pon and lip-g) is the unit information 
matrix at fig. 

To estimate the parameters, we use the Alternating Expectation Conditional Maximization (AECM) 
algorithm. There are two stages of the algorithm. At the first stage of the algorithm, when estimating tt^ 
and fig, we define = {zn, . . . ,ZiG) to be indicator variables showing the component membership of the 
ith observation so that Zig = 1 if belongs to the gth component and Zig = otherwise. Zi is treated as 
the missing data at the first stage. Hence the expected complete data log-likelihood is 

n G n G 

^^^Z^glOgTTg + ^^Z,g log {(f>{x^ \ - (p{fl), 

i—1 g—1 i—1 g—1 

where Zig — %g4>{'x.i \ fig^Yig) /^'^^^TTj(j){'x.i \ fig,'Sg). The M-step maximizes Q to update the parameter 
estimates tt^ and fig. The estimation of tt^ is complicated and has a complex analytic form. However, we 
have observed that in practical applications, the analytical estimate is equivalent to the estimate derived by 
the EM algorithm. Hence, in our analyses (Section |4|, tt^ can be estimated via 



En 

^9 = 

n 

For the mean parameters. 



^^^g 

Hence 



= ^ %(x, - fig) - nX„TrgSign{fig). 



f^gj 



sign(/igi) 



\fLgj\ - A„ Sgl 



+ 



if S„l > 0, 
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^figj otherwise. 



where figj — J2^=i ^ig^ig/ J27=i ^ig update of ^gj if no penalty term were involved, 1 is the vector 

with every element equal to 1, and for any a, a+ = a if a > and a+ = otherwise, figj is a shrunken 
estimate of figj in the sense that figj = if (Sgl)j > and A„ > //gj/(Sgl)j. Otherwise, figj is obtained by 
shrinking the usual EM estimate figj by the amount A„(Sgl)j towards 0. 
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At the second stage of the AECM algorithm, we take the missing data as the group labels Zj and 
the unobserved latent factors u to estimate the variance-covariance matrix under the PGMM set-up. The 
component covariance matrices Si, ... , So are updated as usual, depending on the family of models used; 
seelMcNicholas and Murphyl (120081 120101) for details in the case of the PGMM family. The first stage, where 



the fig and are estimated based on the complete data (x, z), and the second stage, where the constituent 
parts of the Sg are estimated based on the complete data (x, z, u), are iterated until convergence. Extensive 
details on an AECM algorithm for fitting the members of the PGMM family are given by |McLachlan and 
Peel| ( |2000[ ) and |McNicholas et al.| (20101 ). 

To derive a model selection criterion from the penalized log-likelihood, we maximize ^. Using the 
second term of ^ becomes 



~G 



G Pg 

EE 

s=i i=i 



I Mm I 



1 sign(/i 

2 flgj 



9]) f ..2 



-2 \ 
M«) 



where Pg is the number of non-zero mean components in class g. Here we make an assumption that for a given 
model, the mixture components are chosen independently so that the parameters for any two clusters are 
independent. Hence, using the Weak Law of large Numbers with the BIC-type approximation to log£(i? | x), 
the penalized BIG is 



LPBIG = 21og£(i9 I x) -plogn 



2n\ 



G Pg 

^EE 

9=1 J=l 



I Mail 



I Am I 



sign {fig 



(5) 



where p is the number of estimated parameters which are non-zero. Intuitively, the LPBIG further penalizes 
the traditional BIG by both aboslute mean and absolute coefficient of variation of the parameters. The 
derivation is discussed in detail in Appendix [K\ 



3 Asymptotic Properties 
3.1 Properties 

The consistency of a model selection criterion is closely related to the asymptotic identifiability of the 
model. In general, a model Q with the the parameter set is called identifiable if, for any two different sets 
of parameters i9i and i92, 

g{^i) = gi'd2) =^ ^1 = ^2- 

We assume that our model satisfies the asymptotic identifiability condition. In the context of mixture models, 
a criterion is consistent if it can correctly select the number of components and the true set of parameters. 
If the true parameter set t^q is decomposed as (i9oi; '^02) such that i?02 contains only the zero elements, and 
if any estimated parameter 1? that is sufficiently close to i?o is likewise decomposed as ■^2), then in order 
to satisfy consistency, we should have P(i92 — 0) — > 1 as n — > 00 and i9i — > i9oi in probability. Thus, 
the criterion should choose as it would if the true number of clusters and the true parameters were known. 
Based on this idea, we study the consistency of LPBIG with the help of the following assumptions: 

I Let p = O (n") and A„ = o (logn/n). Define an estimate i9 of i9 be such that 1 1 1? — i^o 1 1= {n'^) for 
K > —00. 

II Let 1? = {9i, 6*2, 0^). Then there exist finite real numbers Mi and M2 (possibly depending on k) such 
that 

dlogCi-d I x) 



< ^/i(x) and sup 



\ogC{^ I x) 



de.dek 



< M2(x) 
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Ill is positive-definite for all i9. 



Then, under assumptions |T] to |III[ and assuming that the asymptotic identifiability condition is satisfied, we 
state the following theorem. The proof is given in Appendix [Bj 

Theorem If k < min [0, (a — l)/2], then the LPBIC chooses the number of components and set of pa- 
rameters as it would choose if i?o were known as n — >■ oo. In other words, under the condition k < 
min (0, a - 1/2), if there exists an estimate ^ such that \\'d-'do\\=0 (n") and LPBIC(^) > LPBIC(i9) for 
ah 1? such that \\-d - 'do\ \ = O {n"), then 



Part a P(i92 =0) — > 1 as n 



and 



Part b i?i — > i9oi in probability as n — > oo. 



We prove only Part a with some of the arguments proposed by Khalili and Chen (2007) for an FMR 



setting. The method is modified for mixture models with high-dimensional set-up. Part b of the theorem can 
be proved exactly by the method described in Fan and Li (2001). To prove Part b, we need y/nXn — 0, as 
n ^ oo which is satisfied by Assumption |l] This is particularly important because LASSO-type penalties do 
not satisfy the oracle property, i.e., they do not ensure that a -yn-consistent MLE of 9 exists which satisfies 
Part a and Part b. This is because the existence of a -yri-consistent MLE requires that y/nXn — > oo and 
the consistency of ??i needs that y/nXn — > 0. Hence, under a tighter assumption, we show that if such an 
estimator exists, then it satisfies consistency. Other non-concave penalties like SCAD or HARD, however, 
can satisfy the oracle property with a proper choice of the tuning parameter. 



3.2 Choice of A„ 



Generally the tuning parameters are chosen by cross-validation ( Stone 1974 ) or generalized cross-validation 
(Craven and Wahaba 1979). We should remember that A„ depends on n. To satisfy the asymptotic 



properties, we require A = o(logn/n). Khalili and Chen (20071 derived a component-wise deviance-based 
GCV with the above conditions in order to estimate A. The method, though originally used in regression, 
also serves well for mixture models. The present paper takes the working sequence A„ = 1/p and studies 
the behaviour of the LPBIC. The methods proposed by Khalili and Chen (2007), modified for a mixture 



model, are also considered and provide a range for the values of A„. It is observed that for moderately large 
n {n > 50), A„ = 1/p falls into that range. For our data analysis (Section |4]), we studied the behaviour of 
LPBIC for different values of A„ within that range. For illustration, though, a single A„ is chosen because 
the behaviour of the LPBIC is uniform over different A„ values within that range. 



4 Data Analysis 



4.1 Overview 

We analyze two data sets and compare the results using the PBIC to those with the BIC for the PGMM 
family. The first one is a high-dimensional simulated data set and the second one is a real high-dimensional 
data set. Although run as cluster analyses, the true group memberships are known in each case and we use 
the adjusted Rand index (ARI: Rand 1971 Hubert and Arable 1985 ) to refiect classification agreement. A 



value of 1 indicates perfect agreement and a value of would be expected under random classification. 



4.2 Simulated Data 

We generate a simulated p-dimensional Gaussian data set consisting of three groups. We set fi^ — —5.51, Si 
isotropic; ^ 21, S2 diagonal; and /Xg = 31, S3 full, with rii = 40, n2 = 30, = 30. We ran simulations 
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for p e {100, 250, 500}. LPBIC values are observed for each member of the PGMM family for G = 1, . . . , 4 
and q = 1,2,3. The results (Table [T]) show that the PBIC consistently chooses G = 3 as p gets larger but 
that the BIG fails in higher dimensions, choosing a G = 2 component model. The associated ART values 
(Table [T]) confirm that the models selected by the PBIC capture the underlying group structure better than 
those chosen by the BIC, especially in higher dimensions. 



Table 1: Best model chosen by PBIC and BIC for high-dimensional simulated data. 



p= 100 
p = 250 
p = 500 



3 
3 
3 



LPBIC 
Model 

cue 
cue 
cue 



"ART 

0.82 
0.97 



G 
3 
2 
2 



BIC 
Model 

cue 
ccc 
ccc 



"ART 

0.62 
0.49 



The effect of increasing dimension on the performance of the BIC is clear: the BIC chooses fewer mixture 
components and latent factors, as well as a more parsimonious covariance structure. The LPBIC, however, 
chooses the same number of components and the same covariance structure each time, and the number of 
factors does not decrease with p. 

Next, we generate 25 simulations of the p — 500 dimensional data and study the behaviour of BIC and 
LPBIC for selecting G and for clustering performance (i.e., ARI). The results (Figure [l]) show that LPBIC 
correctly chooses the number of components (G = 3) 23 times but the BIC only selects G = 3 four times 
out of 25. As expected, the BIC tends to choose too few components. The ARls for models selected using 
the LPBIC are higher than those selected using the BIC. Out of 25 simulations, the ARI with the LPBIC is 
higher than that for the BIC in 21 cases, illustrating generally superior clustering performance. 



^3 S - ' ' 





Figure 1: Plot of the performance of LPBIC and BIC for 25 simulations. The left-hand plot shows the 
selection of number of components by the BIC and LPBIC. The right-hand plot shows the ARIs of the 
models selected by LPBIC and BIC. 
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4.3 Leukaemia data 



Golub ( 1999 ) presented data on two forms of acute leukaemia: acute lymphoblastic leukaemia (ALL) and 



acute myeloid leukaemia (AML). Affymetrix arrays were used to collect measurements for 7,129 genes on 72 
tissues. There were a total of 47 ALL tissues and 25 with AML. McLachlan et al. (2002) reduced the data 
set as follows: 

1. Genes with expression falling outside the interval (100, 16000) are removed. 

2. Genes with expression satisfying max/min < 5 or max-min < 500 are removed. 



software (cf. McLachlan et al. 



McNicholas and Murphy (|2010| further reduced the number of genes to 2,030 by applying the select-genes 

We analyze these 2,030 genes using 20 different random starts for the 
6. 



2002) . 



initial Zig. We run our approach for G G {1,2} and q 



Table 2: Comparison of the performance of LPBIC and BIG for PGMM model selection for the leukaemia 
data. 





Value 


G 


q 


Model 


ARI 


BIG 


-400394 


1 


2 


GGU 


0.29 


LPBIG 


-391023 


2 


1 


GUG 


0.47 



Summaries of the models selected by the LPBIG and the BIG, respectively, are given in Table [2j The 
BIG chooses a GGU model with G ~ 1 component and q — 2 factors. The LPBIG chooses a GUG model 
with G — 2 components and q — I factors. The ARI of the model chosen using LPBIG (0.47) is greater than 
that for the model chosen using the BIG (0.29). The model selected using the LPBIG misclassifies eleven of 
the 72 samples (Table [s]). 



Table 3: Glassification table of the best model chosen by LPBIG. 





1 


2 


ALL 


39 


3 


AML 


8 


22 



5 Discussion 



The paper proposes a LPBIG through a penalized likelihood-based approach in the context of parsimonious 
Gaussian mixture model selection. The approach is mainly intended for the high-dimensional setting, where 
the BIG has some unattractive problems due to an 'exploding' penalty term for high-dimensional data. Our 
LPBIC approach does not use the total number of independent parameters to be estimated in its penalty 
term but, rather, the total number of independent non-zero parameters to be estimated. This has some 
advantages. Because the likelihood is penalized by a tuning parameter, many of the mean components 
become 0, thereby reducing the number of independent estimable parameters. The loss of information due 
to penalizing the likelihood is somehow compensated for by both absolute mean and absolute coefHcient of 
variation of the mean parameters. 

The choice of tuning parameters is an important aspect in this scenario because no theoretical result exists 



which specifies the best choice. Recently, Wang et al. (2007 2009) proposed some interesting mathematical 
methods of choosing the tuning parameters without requiring cross-validation. However, their method is 
most suitable in low-dimensional settings. Herein, we followed an approach close to the one proposed by 



Fan and Li (2001), though careful modifications have been taken to preserve the asymptotic properties. 



accounting for the nature of the data. 
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Our method seems consistent in choosing the right number of clusters for high-dimensional data, as shown 
through the analysis of real and simulated data. Our analyses suggest that the LPBIC is an improvement 
over the BIG in the high-dimensional setting. What we lose is the oracle property, because the LASSO 
may fail to satisfy the consistency, sparsity and asymptotic normality all at the same time. But the LASSO 
has some computational advantages because of convexity and hence it is preferred over other non-concave 
penalties. 

Of course, the LPBIC is not without its issues. One problem arises by locally approximating the penalty 
function: if an estimator is shrunken, it stays at 0. Another arises if the initial domain of the estimates 
docs not contain the posterior mode, or even if the posterior mode lies at the boundary of the domain. This 
second problem, which will lead to failure, is a general problem with the EM algorithm. 

Future work will focus on the use of penalties that lead to consistent model selection criteria. We are in 
particular interested in the adaptive LASSO which leads to the oracle properties. We shall also study the 
penalization of the variance parameters as it will generate greater parsimony. 
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A Derivation of LPBIC 



To derive the LPBIC, we closely follow the derivation of the usual BIC. We have to maximize (|3|. Using 
Q, the second term becomes 



3=1 



nA„E / %E l^wl^^f 



G 



G Pg 

EE 

9=1 i=i 



1 signifigj , 2 



I'-gj 



il^ai - Ao,) 



nX 

~G 



G Pg 



/J2^=iPg- Thus the 



where pg is the number of non-zero mean components in class g. Under the assumption made in Section [21 
fig is at most pg dependent, and the Weak Law of Large Numbers holds. In a large-p setting, X]g=i-Ps 

a large number and so 'Z'^^i Ei=i (a^L - AL) Ejli (-^(A 

second term becomes 

"^EE 

k=ij=i 

The first term, using Taylor's expansion, is 



(^(A.)-^) 



n 



ll^gjl 



exp[log/:(i9 I ^)g{-d)]d& 



exp 



log c{'d\^)g I'd) + {'d-'d) 



di9, 



where Ji is the second derivative matrix of log£(i?)tJ(i?). Because i9 is derived maximizing the penalized 
likelihood, the second term within the integral becomes (i9 — i?)9(p„(/i.)/9i?, where fnil^-) is the LASSO 
penalty function. Using (W|), the mean- value theorem and the fact that the i9 values are close to the 



second term within the integral is ri\n/GJ2g=i J2j=i sigii(Mgi)- 

The third term within the integral similarly becomes l/2(i? — i?)''H^(')? — 19), where i9 is the set of non-zero 

parameters and i? is their estimate. Using Laplace approximation on H and applying the Weak Law of Large 

Numbers, as in the usual BIC, we arrive at log£('i? | x) — l/2plogr7,, where p = dim(i9). This, combined 
with the second term of ([s]), gives ([s]). 
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B Proof of the Asymptotic Property of LPBIC 



First, suppose the true number of clusters G is known with the corresponding parameter i9. Let the true 
parameter be i?o- Let i9 be an arbitrary estimate of Let po and pi be the corresponding number of 
non-zero parameters and and Xn"^ be the corresponding tuning parameters. We first prove that, for an 
arbitrary estimate ^ satisfying \\ ^ - 'd^ \\= O {n'^), LPBIC(i^i, 1^2) - LPBIC(i9i, 0) < as n — > 00. We 
note that 

LPBIC(i9i, 1^2) - LPBIC (1^1,0) = 201,^2 \ x) - 2/(i^i,0 | x) - [a(i^i, ^2) - A(i^i, 0] 
where I — log£ and A is the penalty part of LPBIC. Using the mean- value theorem, 



?(i9i,i92|x)-Z(i9i,0|x) 



^2 



where || ^ ||<|| i92 ||= O (n'") . Also, 



< 



d&2 



d^2 



dl{^i,0) dl{'do,0) 



d^2 



<j2M2{zd \m\ 



i=l 



d^2 



(6) 



O (n) = O {n^+^) 



from Assumption |nj Also, from the last part of the first line of ([6]), which is of order O , we can con- 

clude that dl {'do, 0) /9i?2 is of order O (n'^'^^) , as is ^) /9i?2- Therefore, from these order assessments, 

we conclude that 

G p 

^(^1,^2) -/(^i,o) =0(n«+i)^ ^ fig,, 



g=lj=Pg+l 



where pg is defined as in Q . 

For the part A(i9i,i?2) - A(i9i,0), note that 



2nX 



G 



G Pk 

3=1 i=i 



33 



sign {jxgj) 



O (n"+i) A„ 



because the summation part is some constant times p — 0{n^)^ using Assumption [l| We also have (pi — 

G 

Po) log n = Y.g=i (P ^ Pg) log n = O (n°) log n. Hence, 

G p 

LPBIC(i?i, i92) - LPBIC(i?i, 0) = C'(n'^+i) ^ fig, - 0{n°') \ogn- {X^^^ - Xl^^)0{n°'+^). 

g = l j=pg + l 

The first term of the above expression is O (n''+^) S^=p +1 Agj = ^ (n^''+^). Using Assumption |lj 

i.e., that A„ = o(logn/n), and by order comparison, we can conclude that the leading terms in the above 

expression are O (n^^+i) and O (n") logn. Because a > 2k + 1, LPBIC (i^i, 192) - LPBIC 0^ < as 
n — > 00. 

Now, let ~ {^\, 1^2^ be an estimate of '& such that (■^i, 0) is a maximizer of LPBIC('i?i, 0) satisfying 
II 1^ - i9o Ih O It suffices to show that in the neighbourhood || i9 - i?o lh O (n"), LPBIC i92) - 

LPBIC 



(^1,0) 



< with probability tending to 1 as n — !• 00. We note that 



LPBIC(i9i,i92) -LPBIC(i^i,0) = [LPBIC(i9i,i92) -LPBIC(i9i,0)] -f [LPBIC(i9i, 0) - LPBIC(i9i, 0) 
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where LPBIC('(?i, i?2) — LPBIC(i?i,0) < with probabihty tending to 1 (by the previous result) and 
LPBIC(t?i, 0)-LPBIC(^i, 0) < with probability tending to 1 since (^i, 0) is a maximizer of LPBIC(i?i, 0). 
Thus (i?i,0) maximizes LPBIC(i?i, 1^2) with probability tending to 1 as n — >■ 00. Hence we conclude that 
P ^1^2 = 0^ — > 1 as n — > 00. Hence the proof. 

The case of unknown clusters can be similarly proved. If the estimated number of components is Gi and 
the true number is G, then the estimated parameter corresponding to Gi is, say, We can again decompose 
as (■^1,^2) and similarly show that '02 — > in probability. Here ■^i comprises of the clusters belonging 
to 1?o- 
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